This paper proposes a framework for performing adaptation to complex and non-stationary background conditions in Automatic\nSpeech Recognition (ASR) by means of asynchronous Constrained Maximum Likelihood Linear Regression (aCMLLR) transforms\nand asynchronous Noise Adaptive Training (aNAT). The proposed method aims to apply the feature transform that best\ncompensates the background for every input frame. The implementation is done with a new Hidden Markov Model (HMM) topology\nthat expands the usual left-to-right HMM into parallel branches adapted to different background conditions and permits\ntransitions among them. Using this, the proposed adaptation does not require ground truth or previous knowledge about the background\nin each frame as it aims to maximise the overall log-likelihood of the decoded utterance. The proposed aCMLLR transforms\ncan be further improved by retraining models in an aNAT fashion and by using speaker-based MLLR transforms in cascade\nfor an efficient modelling of background effects and speaker. An initial evaluation in a modified version of the WSJCAM0 corpus\nincorporating 7 different background conditions provides a benchmark in which to evaluate the use of aCMLLR transforms. A\nrelative reduction of 40.5% in Word Error Rate (WER) was achieved by the combined use of aCMLLR and MLLR in cascade.\nFinally, this selection of techniques was applied in the transcription of multi-genre media broadcasts, where the use of aNAT training,\naCMLLR transforms and MLLR transforms provided a relative improvement of 2ââ?¬â??3%.
Loading....